From LLM-as-a-Judge to Human-in-the-Loop: Rethinking Evaluation in RAG and Search
Fernando Rejon Barrera and Daniel Wrigley • Location: TUECHTIG • Back to Haystack EU 2024
Everyone’s using LLMs as judges. In this talk, we’ll explore techniques for LLM-as-a-judge evaluation in Retrieval-Augmented Generation (RAG) systems, where prompts, filters, and retrieval strategies create endless variations.
This begs the question, but how do you evaluate the judges? ELO rankings in chess are a system that calculates the relative skill levels of players based on their game results, with higher ratings indicating stronger players.
We introduce RAGElo, an ELO-style ranking framework that uses LLMs to compare outputs without needing gold answers - bringing structure to subjective judgments at scale. Then we showcase the integration of RAGElo into the Search Relevance Workbench, released in OpenSearch 3: a human-in-the-loop toolkit that lets you dig deep into search results, compare configurations, and spot issues metrics miss. Together, these tools balance automation and intuition - helping you build better retrieval and generation systems with confidence.
Download the Slides Watch the Video 
            Fernando Rejon Barrera
Zeta AlphaFernando is the CTO of Zeta Alpha, an Amsterdam-based startup helping High-Tech and R&D Enterprises take their Generative AI projects to production. With a PhD in theoretical physics and a hacker (builder) past, Fernando brings a unique blend of theoretical knowledge and hands-on expertise to his role. He is a generalist with a passion for building innovative solutions and working collaboratively, particularly in the realm of AI.
 
              Daniel Wrigley
OpenSource ConnectionsDaniel is a Search Consultant at OpenSource Connections. He has worked in search since graduating in computational linguistics studies at Ludwig-Maximilians-University Munich in 2012 where he developed his weakness for search and natural language processing. His experience as a search consultant paved the way for becoming an O’Reilly author co-authoring the first German book on Apache Solr. He is an active contributor to open source projects and is one of the maintainers of the Elasticsearch Learning to Rank plugin. Early in 2025 he founded a new meetup in Munich that has its focus around advancements in modern search developments.